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Abstract 

Meaning of a word varies from one do¬ 
main to another. Despite this impor¬ 
tant domain dependence in word seman¬ 
tics, existing word representation learning 
methods are bound to a single domain. 
Given a pair - of source-target domains, 
we propose an unsupervised method for 
learning domain-specific word representa¬ 
tions that accurately capture the domain- 
specific aspects of word semantics. First, 
we select a subset of frequent words that 
occur in both domains as pivots. Next, 
we optimize an objective function that 
enforces two constraints: (a) for both 
source and target domain documents, piv¬ 
ots that appear in a document must accu¬ 
rately predict the co-occurring non-pivots, 
and (b) word representations learnt for 
pivots must be similar in the two do¬ 
mains. Moreover, we propose a method 
to perform domain adaptation using the 
learnt word representations. Our proposed 
method significantly outperforms compet¬ 
itive baselines including the state-of-the- 
art domain-insensitive word representa¬ 
tions, and reports best sentiment classifi¬ 
cation accuracies for all domain-pairs in a 
benchmark dataset. 

1 Introduction 

Learning semantic representations for words is a 
fundamental task in NLP that is required in nu¬ 
merous higher-level NLP applications ( |Collobert 
et al., 201 lj ). Distributed word representations 
have gained much popularity lately because of 
their accuracy as semantic representations for 
words 
2014 ). 

varies from one domain to another. For exam- 


(Mikolov et al., 2013a; Pennington et al., 
However, the meaning of a word often 


pie, the phrase lightweight is often used in a posi¬ 
tive sentiment in the portable electronics domain 
because a lightweight device is easier to carry 
around, which is a positive attribute for a portable 
electronic device. However, the same phrase has a 
negative sentiment assocition in the movie domain 
because movies that do not invoke deep thoughts 


in viewers are considered to be lightweight (Bol 


legala et al., 2014). However, existing word rep¬ 
resentation learning methods are agnostic to such 
domain-specific semantic variations of words, and 
capture semantics of words only within a single 
domain. To overcome this problem and capture 
domain-specific semantic orientations of words, 
we propose a method that learns separate dis¬ 
tributed representations for each domain in which 
a word occurs. 

Despite the successful applications of dis¬ 
tributed word representation learning meth¬ 
ods (Pennington et al., 2014; Collobert et al., 
2011; Mikolov et al., 2013a) most existing ap¬ 
proaches are limited to learning only a single 


representation for a given word (Reisinger and 
Mooney, 2010). Although there have been some 
work on learning multiple prototype representa¬ 


tions (Huang et al., 2012;J|Neelakantan et al., 2014 1 
for a word considering its multiple senses, such 
methods do not consider the semantics of the do¬ 
main in which the word is being used. 

If we can learn separate representations for a 
word for each domain in which it occurs, we can 
use the learnt representations for domain adapta¬ 
tion tasks such as cross-domain sentiment clas¬ 


sification (Bollegala et al., 2011b), cross-domain 
POS tagging ( [Schnabel and Schiitze, 2013j ), cross¬ 
domain dependency parsing (McClosky et al., 
2010), and domain adaptation of relation extrac¬ 
tors ( [Bollegala et al., 2013a[ Bollegala et aL 
2013b[ Bollegala et al., 2 011a; Jiang and Zhai, 
2007at|jTang and Zhai, 2007b i. 

We introduce the cross-domain word represen- 

















































tation learning task, where given two domains, 
(referred to as the source (S) and the target (T)) 
the goal is to learn two separate representations 
wg and wp for a word w respectively from the 
source and the target domain that capture domain- 
specific semantic variations of w. In this paper, 
we use the term domain to represent a collection 
of documents related to a particular topic such as 
user-reviews in Amazon for a product category 
(e.g. books, dvds, movies, etc.). However, a do¬ 
main in general can be a field of study (e.g. biol¬ 
ogy, computer science, law, etc.) or even an entire 
source of information (e.g. twitter, blogs, news 
articles, etc.). In particular, we do not assume the 
availability of any labeled data for learning word 
representations. 


This problem setting is closely related to unsu¬ 
pervised domain adaptation (Blitzer et a h, 2006) ), 
which has found numerous useful applications 
such as, sentiment classification and POS tagging. 
For example, in unsupervised cross-domain sen¬ 
timent classification ([Blitzer et al., 2006] Blitzer 


et al., 2007 1 , we train a binary sentiment classifier 


using positive and negative labeled user reviews 
in the source domain, and apply the trained clas¬ 
sifier to predict sentiment of the target domain’s 
user reviews. Although the distinction between the 
source and the target domains is not important for 
the word representation learning step, it is impor¬ 
tant for the domain adaptation tasks in which we 
subsequently evaluate the learnt word representa¬ 
tions. Following prior work on domain adapta¬ 
tion ( Blitzer et al., 2006) ), high-frequent features 
(unigrams/bigrams) common to both domains are 
referred to as domain-independent features or piv¬ 
ots. In contrast, we use non-pivots to refer to fea¬ 
tures that are specific to a single domain. 


We propose an unsupervised cross-domain 
word representation learning method that jointly 
optimizes two criteria: (a) given a document d 
from the source or the target domain, we must ac¬ 
curately predict the non-pivots that occur in d us¬ 
ing the pivots that occur in d, and (b) the source 
and target domain representations we learn for piv¬ 
ots must be similar. The main challenge in domain 
adaptation is feature mismatch, where the features 
that we use for training a classifier in the source 
domain do not necessarily occur in the target do¬ 
main. Consequently, prior work on domain adap¬ 
tation ( [Blitzer et al., 2006 [Pan et al., 2010| ) learn 
lower-dimensional mappings from non-pivots to 


pivots, thereby overcoming the feature mismatch 
problem. Criteria (a) ensures that word represen¬ 
tations for domain-specific non-pivots in each do¬ 
main are related to the word representations for 
domain-independent pivots. This relationship en¬ 
ables us to discover pivots that are similar to tar¬ 
get domain-specific non-pivots, thereby overcom¬ 
ing the feature mismatch problem. 

On the other hand, criteria (b) captures the prior 
knowledge that high-frequent words common to 
two domains often represent domain-independent 
semantics. For example, in sentiment classifica¬ 
tion, words such as excellent or terrible would ex¬ 
press similar sentiment about a product irrespec¬ 
tive of the domain. However, if a pivot expresses 
different semantics in source and the target do¬ 
mains, then it will be surrounded by dissimilar 
sets of non-pivots, and reflected in the first crite¬ 
ria. Criteria (b) can also be seen as a regulariza¬ 
tion constraint imposed on word representations to 
prevent overfitting by reducing the number of free 
parameters in the model. 

Our contributions in this paper can be summa¬ 
rized as follows. 


• We propose a distributed word representa¬ 
tion learning method that learns separate 
representations for a word for each do¬ 
main in which it occurs. To the best 
of our knowledge, ours is the first-ever 
domain-sensitive distributed word represen¬ 
tation learning method. 


• Given domain-specific word representations, 
we propose a method to learn a cross-domain 
sentiment classifier. 


Although word representation learning meth¬ 
ods have been used for various related 
tasks in NLP such as similarity measure¬ 
ment ( Mikolov et al., 2013c[ ), POS tag¬ 
ging (Collobert et al., 2011), dependency 
parsing (Socher et al., 201 la I, machine trans¬ 
lation ( Zou et al., 2013 1, sentiment classifica¬ 
tion ( |Socher et al., 201 lb] ), and semantic role 
labeling (Roth and Woodsend, 2014), to the 
best of our knowledge, word representations 
methods have not yet been used for cross¬ 
domain sentiment classification. 


Experimental results for cross-domain senti¬ 
ment classification on a benchmark dataset show 
that the word representations learnt using the pro¬ 
posed method statistically significantly outper- 































form a state-of-the-art domain-insensitive word 
representation learning method (Pennington et al., 


2014), and several competitive baselines. In par¬ 
ticular, our proposed cross-domain word represen¬ 
tation learning method is not specific to a par¬ 
ticular task such as sentiment classification, and 
in principle, can be in applied to a wide-range 
of domain adaptation tasks. Despite this task- 
independent nature of the proposed method, it 
achieves the best sentiment classification accu¬ 
racies on all domain-pairs, reporting statistically 
comparable results to the current state-of-the-art 
unsupervised cross-domain sentiment classifica¬ 
tion methods ( |Panet al., 2010 Blitzer et al., 2006 1. 


2 Related Work 

Representing the semantics of a word using some 
algebraic structure such as a vector (more gener¬ 
ally a tensor) is a common first step in many NLP 
tasks ( Turney and Pantel, 2010j ). By applying al¬ 
gebraic operations on the word representations, we 
can perform numerous tasks in NLP, such as com¬ 
posing representations for larger textual units be¬ 


yond individual words such as phrases (Mitchell 


and Lapata, 2008). Moreover, word representa¬ 
tions are found to be useful for measuring se¬ 
mantic similarity, and for solving proportional 
analogies ( Mikolov et al., 2013c[ ). Two main ap¬ 
proaches for computing word representations can 
be identified in prior work (Baroni et al., 2014 1 : 
counting-based and prediction-based. 


In counting-based approaches (Baroni and 


Lenci, 2010), a word w is represented by a vec¬ 


tor w that contains other words that co-occur with 
w in a corpus. Numerous methods for selecting 
co-occurrence contexts such as proximity or de¬ 


pendency relations have been proposed (Turney 


and Pantel, 2010). Despite the numerous suc¬ 
cessful applications of co-occurrence counting- 
based distributional word representations, their 
high dimensionality and sparsity are often prob¬ 
lematic in practice. Consequently, further post¬ 
processing steps such as dimensionality reduction, 
and feature selection are often required when us¬ 
ing counting-based word representations. 

On the other hand, prediction-based approaches 
first assign each word, for example, with a d- 
dimensional real-vector, and learn the elements of 
those vectors by applying them in an auxiliary task 
such as language modeling, where the goal is to 
predict the next word in a given sequence. The 


dimensionality d is fixed for all the words in the 
vocabulary, and, unlike counting-based word rep¬ 
resentations, is much smaller (e.g. d € [10,1000] 
in practice) compared to the vocabulary size. The 


neural network language model (NNLM) (Bengio 


et al., 2003) uses a multi-layer feed-forward neu¬ 


ral network to predict the next word in a sequence, 
and uses backpropagation to update the word vec¬ 
tors such that the prediction error is minimized. 


Although NNLMs learn word representations 
as a by-product, the main focus on language 
modeling is to predict the next word in a sen¬ 
tence given the previous words, and not learn¬ 
ing word representations that capture semantics. 
Moreover, training multi-layer neural networks 
using large text coipora is time consuming. To 
overcome those limitations, methods that specif¬ 
ically focus on learning word representations that 
model word co-occurrences in large coipora have 


been proposed (Mikolov et ah, 2013a| 

Mnih and 

Kavukcuoglu, 20131 Huang et ah, 2012, 

Pen- 

nington et ah, 2014). Unlike the NNLM, these 


methods use all the words in a contextual win¬ 
dow in the prediction task. Methods that use 
one or no hidden layers are proposed to improve 
the scalability of the learning algorithms. For 
example, the skip-gram model (Mikolov et al., 


2013b) predicts the words c that appear in the 


local context of a word w, whereas the continu¬ 
ous bag-of-words model (CBOW) predicts a word 
w conditioned on all the words c that appear in 
w ’s local context ( |Mikolov et al., 2013a| ). Meth¬ 
ods that use global co-occurrences in the entire 
corpus to learn word representations have shown 
to outperform methods that use only local co¬ 
occurrences (Huang et al., 2012; Pennington et 


al., 2014|>. Overall, prediction-based methods 


have shown to outperform counting-based meth¬ 
ods ( [Baroni et al., 2014] ). 


Despite their impressive performance, existing 
methods for word representation learning do not 
consider the semantic variation of words across 
different domains. However, as described in Sec¬ 
tion [T] the meaning of a word vary from one do¬ 
main to another, and must be considered. To the 
best of our knowledge, the only prior work study¬ 
ing the problem of word representation variation 
across domains is due to Bollegala et al. ( [2014 1 . 
Given a source and a target domain, they first se¬ 
lect a set of pivots using pointwise mutual infor¬ 
mation, and create two distributional representa- 







































































tions for each pivot using their co-occurrence con¬ 
texts in a particular domain. Next, a projection 
matrix from the source to the target domain feature 
spaces is learnt using partial least squares regres¬ 
sion. Finally, the learnt projection matrix is used 
to find the nearest neighbors in the source domain 
for each target domain-specific features. However, 
unlike our proposed method, their method does 
not learn domain-specific word representations, 
but simply uses co-occurrence counting when cre¬ 
ating in-domain word representations. 

Faralli et al. ( |2012 1 proposed a domain-driven 
word sense disambiguation (WSD) method where 
they construct glossaries for several domain us¬ 
ing a pattern-based bootstrapping technique. This 
work demonstrates the importance of considering 
the domain specificity of word senses. However, 
the focus of their work is not to learn representa¬ 
tions for words or their senses in a domain, but to 
construct glossaries. It would be an interesting fu¬ 
ture research direction to explore the possibility of 
using such domain-specific glossaries for learning 
domain-specific word representations. 

Neelakantan et al. ( 2014[ ) proposed a method 
that jointly performs WSD and word embedding 
learning, thereby learning multiple embeddings 
per word type. In particular, the number of senses 
per word type is automatically estimated. How¬ 
ever, their method is limited to a single domain, 
and does not consider how the representations vary 
across domains. On the other hand, our proposed 
method learns a single representation for a partic¬ 
ular word for each domain in which it occurs. 

Although in this paper we focus on the mono¬ 
lingual setting where source and target domains 
belong to the same language, the related setting 
where learning representations for words that are 
translational pairs across languages has been stud¬ 
ied (Hermann and Blunsom, 2014; Klementiev et 


al., 2012[ Gouws et al., 2015). Such representa¬ 


tions are particularly useful for cross-lingual in¬ 
formation retrieval ( |Duc et al., 2010 ). It will be an 
interesting future research direction to extend our 
proposed method to learn such cross-lingual word 
representations. 


3 Cross-Domain Representation 
Learning 

We propose a method for learning word represen¬ 
tations that are sensitive to the semantic variations 
of words across domains. We call this problem 


cross-domain word representation learning, and 
provide a definition in Section 3.1 Next, in Sec¬ 
tion [3T2J given a set of pivots that occurs in both a 
source and a target domain, we propose a method 
for learning cross-domain word representations. 
We defer the discussion of pivot selection meth- 

we propose a 


ods to Section 3.4 In Section 3.5 


method for using the learnt word representations 
to train a cross-domain sentiment classifier. 


3.1 Problem Definition 


Let us assume that we are given two sets of docu¬ 
ments Vs and Vp respectively for a source ( S ) 
and a target (T) domain. We do not consider 
the problem of retrieving documents for a domain, 
and assume such a collection of documents to be 
given. Then, given a particular word w, we define 
cross-domain representation learning as the task of 
learning two separate representations ws and wp 
capturing w ’s semantics in respectively the source 
S and the target T domains. 

Unlike in domain adaptation, where there is a 
clear distinction between the source (i.e. the do¬ 
main on which we train) vs. the target (i.e. the 
domain on which we test) domains, for represen¬ 
tation learning purposes we do not make a distinc¬ 
tion between the two domains. In the unsupervised 
setting of the cross-domain representation learn¬ 
ing that we study in this paper, we do not assume 
the availability of labeled data for any domain for 
the purpose of learning word representations. As 
an extrinsic evaluation task, we apply the trained 
word representations for classifying sentiment re¬ 
lated to user-reviews (Section |3.5[ ). However, for 
this evaluation task we require sentiment-labeled 
user-reviews from the source domain. 


Decoupling of the word representation learn¬ 
ing from any tasks in which those representations 
are subsequently used, simplifies the problem as 
well as enables us to learn task-independent word 
representations with potential generic applicabil¬ 
ity. Although we limit the discussion to a pair of 
domains for simplicity, the proposed method can 
be easily extended to jointly learn word represen¬ 
tations for more than two domains. In fact, prior 
work on cross-domain sentiment analysis show 
that incorporating multiple source domains im¬ 
proves sentiment classification accuracy on a tar¬ 


get domain (Bollegala et al., 2011b Glorot et al., 


2011 ). 






















3.2 Proposed Method 

To describe our proposed method, let us denote a 
pivot and a non-pivot feature respectively by c and 
w. Our proposed method does not depend on a 
specific pivot selection method, and can be used 
with all previously proposed methods for selecting 
pivots as explained later in Section 3.4 A pivot 
c is represented in the source and target domains 
respectively by vectors c$ E M n and c-y E M n . 
Likewise, a source specific non-pivot w is repre¬ 
sented by w$ in the source domain, whereas a tar¬ 
get specific non-pivot w is represented by w-y in 
the target domain. By definition, a non-pivot oc¬ 
curs only in a single domain. For notational conve¬ 
nience we use w to denote non-pivots in both do¬ 
mains when the domain is clear from the context. 
We use Cs, Ws,Cf, and VV-y to denote the sets 
of word representation vectors respectively for the 
source pivots, source non-pivots, target pivots, and 
target non-pivots. 

Let us denote the set of documents in the source 
and the target domains respectively by Vg and 
V-y. Following the bag-of-features model, we as¬ 
sume that a document D is represented by the set 
of pivots and non-pivots that occur in D (w E d 
and c E d). We consider the co-occurrences 
of a pivot c and a non-pivot w within a fixed- 
size contextual window in a document. Following 


prior work on representation learning (Mikolov et 


al., 2013a), in our experiments, we set the win¬ 


dow size to 10 tokens, without crossing sentence 
boundaries. The notation (c, w) E d denotes the 
co-occurrence of a pivot c and a non-pivot w in a 
document d. 

We learn domain-specific word representations 
by maximizing the prediction accuracy of the non¬ 
pivots w that occur in the local context of a pivot 
c. The hinge loss, L(Cs,Ws), associated with 
predicting a non-pivot w in a source document 
d E Vg that co-occurs with pivots c is given by: 


^2 ^2 ^2 max (°i 1 - c s T w s + c s T w* s ^j 

d(z'Dg (c,w)(zd w*'^-p(w) 

(i) 

Here, is the source domain representation of 
a non-pivot w* that does not occur in d. The loss 
function given by Eq. [T] requires that a non-pivot 
w that co-occurs with a pivot c in the document d 
is assigned a higher ranking score as measured by 
the inner-product between cs and w$ than a non¬ 
pivot w* that does not occur in d. We randomly 
sample k non-pivots from the set of all source do¬ 


main non-pivots that do not occur in d as w*. 

Specifically, we use the marginal distribution 
of non-pivots p(w), estimated from the corpus 
counts, as the sampling distribution. We raise 
p(w) to the 3/4-th power as proposed by Mikolov 
et al. ( |2013a| ), and normalize it to unit probabil¬ 
ity mass prior to sampling k non-pivots w* per 
each co-occurrence of (c, w) E d. Because non- 
occurring non-pivots w* are randomly sampled, 
prior work on noise contrastive estimation has 
found that it requires more negative samples than 
positive samples to accurately learn a prediction 
model (Mnih and Kavukcuoglu, 20131. We exper¬ 
imentally found k = 5 to be an acceptable trade¬ 
off between the prediction accuracy and the num¬ 
ber of training instances. 

Likewise, the loss function L(C-y, Wp) for pre¬ 
dicting non-pivots using pivots in the target do¬ 
main is given by: 


E E E max ^0,1 — ct T wt + cj- T 

d£'Dq- (c,w)£d w*^jp(w) 


( 2 ) 


Here, w* denotes target domain non-pivots that 
do not occur in d, and are randomly sampled 
from p(w) following the same procedure as in the 
source domain. 

The source and target loss functions given re¬ 
spectively by Eqs.[j]and[2]can be used on their own 
to independently learn source and target domain 
word representations. However, by definition, piv¬ 
ots are common to both domains. We use this 
property to relate the source and target word repre¬ 
sentations via a pivot-regularizer, R(Cs,C-y), de¬ 
fined as: 


R{C s ,Cr) = \_Y j \\cf - 4 ] \\ (3) 

i =1 

Here, ||x|| represents the I 2 norm of a vector x, 
and cW is the i-th pivot in a total collection of K 
pivots. Word representations for non-pivots in the 
source and target domains are linked via the pivot 
regularizer because, the non-pivots in each domain 
are predicted using the word representations for 
the pivots in each domain, which in turn are reg¬ 
ularized by Eq. [3] The overall objective function, 
L(Cs,Ws,Ct, WV), we minimize is the surrQof 

‘Weighting the source and target loss functions by the re¬ 
spective dataset sizes did not result in any significant increase 
in performance. We believe that this is because the bench¬ 
mark dataset contains approximately equal numbers of docu¬ 
ments for each domain. 









the source and target loss functions, regularized 
via Eq.[3]with coefficient A, and is given by: 

L(C s ,W s ,) + L(Ct,Wt) + \R(C s ,Ct) (4) 

3.3 Training 

Word representations of pivots c and non-pivots w 
in the source (eg, wg) and the target (c-p, wj-) do¬ 
mains are parameters to be learnt in the proposed 
method. To derive parameter updates, we compute 
the gradients of the overall loss function in Eq. [4] 
w.r.t. to each parameter as follows: 


dL 

dws 

dL 

dw% 

dL 

dwr 

dL 
dw* T 

dL 

dcs 

dL 

dcr 


0 if cs T {ws — w$) > 1 
cs otherwise 

0 if cs T (ws-ws)> 1 
cs otheriwse 


(5) 

( 6 ) 


0 if Cq- T (wt — w t) — 1 
— cj- otherwise 

0 if cj- T (wj- — Wj-) > 1 
cj- otherwise 


(7) 

( 8 ) 


A(cs — cr) if cs T (ws — wg) > 1 

w% — ws + A (cs — cr) otherwise 

(9) 


A(cr — cs) if cj- (wt — w^-) > 1 

Wq — wt + A(ct — cs) otherwise 

( 10 ) 


Here, for simplicity, we drop the arguments inside 
the loss function and write it as L. We use mini 
batch stochastic gradient descent with a batch size 
of 50 instances. AdaGrad ( |Duchi et al., 2011] ) is 
used to schedule the learning rate. All word repre¬ 
sentations are initialized with n dimensional ran¬ 
dom vectors sampled from a zero mean and unit 
variance Gaussian. Although the objective in Eq.[4] 
is not jointly convex in all four representations, 
it is convex w.r.t. the representation of a partic¬ 
ular feature (pivot or non-pivot) when the repre¬ 
sentations for all the other features are held fixed. 
In our experiments, the training converged in all 
cases with less than 100 epochs over the dataset. 

The rank-based predictive hinge loss (Eq. |T|) 
is inspired by the prior work on word represen¬ 


tation learning for a single domain (Collobert et 


al., 2011). However, unlike the multilayer neu¬ 


ral network in Collobert et al. (20111, the pro¬ 
posed method uses a computationally efficient sin¬ 
gle layer to reduce the number of parameters that 
must be learnt, thereby scaling to large datasets. 
Similar to the skip-gram model (Mikolov et al., 


2013a] ), the proposed method predicts occurrences 
of contexts (non-pivots) w within a fixed-size con¬ 
textual window of a target word (pivot) c. 

Scoring the co-occurrences of two words c and 
w by the bilinear form given by the inner-product 
is similar to prior work on domain-insensitive 


word-representation learning (Mnih and Hinton, 


2008} Mikolov et al., 2013a) l. However, unlike 


those methods that use the softmax function to 
convert inner-products to probabilities, we directly 
use the inner-products without any further trans¬ 
formations, thereby avoiding computationally ex¬ 
pensive distribution normalizations over the entire 
vocabulary. 


3.4 Pivot Selection 

Given two sets of documents Vg, V 7 - respec¬ 
tively for the source and the target domains, we 
use the following procedure to select pivots and 
non-pivots. First, we tokenize and lemmatize each 
document using the Stanford CoreNLP toolki0 
Next, we extract unigrams and bigrams as features 
for representing a document. We remove features 
listed as stop words using a standard stop words 
list. Stop word removal increases the effective co¬ 
occurrence window size for a pivot. Finally, we 
remove features that occur less than 50 times in 
the entire set of documents. 

Several methods have been proposed in the 
prior work on domain adaptation for selecting a 
set of pivots from a given pair of domains such 
as the minimum frequency of occurrence of a fea¬ 
ture in the two domains, mutual information (MI), 
and the entropy of the feature distribution over the 


documents ( Pan et al., 2010] ). In our preliminary 
experiments, we discovered that a normalized ver¬ 
sion of the PMI (NPMI) ( jBouma, 2009 ) to work 
consistently well for selecting pivots from differ¬ 
ent pairs of domains. NPMI between two features 
x and y is given by: 


NPMI(z , V) l0S {p(x)p(y )) - log (p(x, y)) 

( 11 ) 

Here, the joint probability p(x, y), and the 
marginal probabilities p(x) and p(y) are estimated 
using the number of co-occurrences of x and y in 
the sentences in the documents. Eq. [FT] normalizes 
both the upper and lower bounds of the PMI. 

'http://nip.Stanford.edu/software/ 
corenlp.shtml 































We measure the appropriateness of a feature as 
a pivot according to the score given by: 


score(.x) = min (NPMI(x, S), NPMI(x, T )). 

( 12 ) 

We rank features that are common to both domains 
in the descending order of their scores as given by 
Eq. [L2| and select the top Np features as pivots. 
We rank features x that occur only in the source 
domain by NPMI(.x. 5), and select the top ranked 
Ng features as source-specific non-pivots. Like¬ 
wise, we rank the features x that occur only in the 
target domain by NPMI(x\ T), and select the top 
ranked Nj- features as target-specific non-pivots. 

The pivot selection criterion described here dif¬ 
fers from that of Blitzer et al. ( |2006[|2007) l, where 
pivots are defined as features that behave similarly 
both in the source and the target domains. They 
compute the mutual information between a feature 
(i.e. unigrams or bigrams) and the sentiment labels 
using source domain labeled reviews. This method 
is useful when selecting pivots that are closely as¬ 
sociated with positive or negative sentiment in the 
source domain. However, in unsupervised domain 
adaptation we do not have labeled data for the tar¬ 
get domain. Therefore, the pivots selected using 
this approach are not guaranteed to demonstrate 
the same sentiment in the target domain as in the 
source domain. On the other hand, the pivot se¬ 
lection method proposed in this paper focuses on 
identifying a subset of features that are closely as¬ 
sociated with both domains. 

It is noteworthy that our proposed cross-domain 
word representation learning method (Section [3~2^ > 
does not assume any specific pivot/non-pivot se¬ 
lection method. Therefore, in principle, our pro¬ 
posed word representation learning method could 
be used with any of the previously proposed pivot 
selection methods. We defer a comprehensive 
evaluation of possible combinations of pivot selec¬ 
tion methods and their effect on the proposed word 
representation learning method to future work. 


3.5 Cross-Domain Sentiment Classification 

As a concrete application of cross-domain word 
representations, we describe a method for learning 
a cross-domain sentiment classifier using the word 
representations learnt by the proposed method. 
Existing word representation learning methods 
that learn from only a single domain are typi¬ 
cally evaluated for their accuracy in measuring se¬ 
mantic similarity between words, or by solving 


word analogy problems. Unfortunately, such gold 
standard datasets capturing cross-domain seman¬ 
tic variations of words are unavailable. Therefore, 
by applying the learnt word representations in a 
cross-domain sentiment classification task, we can 
conduct an indirect extrinsic evaluation. 

The train data available for unsupervised cross¬ 
domain sentiment classification consists of unla¬ 
beled data for both the source and the target do¬ 
mains as well as labeled data for the source do¬ 
main. We train a binary sentiment classifier using 
those train data, and apply it to classify sentiment 
of the target test data. 

Unsupervised cross-domain sentiment classifi¬ 
cation is challenging due to two reasons: feature- 
mismatch, and semantic variation. First, the sets 
of features that occur in source and target domain 
documents are different. Therefore, a sentiment 
classifier trained using source domain labeled data 
is likely to encounter unseen features during test 
time. We refer to this as the feature-mismatch 
problem. Second, some of the features that occur 
in both domains will have different sentiments as¬ 
sociated with them (e.g. lightweight). Therefore, 
a sentiment classifier trained using source domain 
labeled data is likely to incorrectly predict simi¬ 
lar sentiment (as in the source) for such features. 
We call this the semantic variation problem. Next, 
we propose a method to overcome both problems 
using cross-domain word representations. 

Let us assume that we are given a set 
r/W)}" =1 of n labeled reviews x'g for the 
source domain S. For simplicity, let us consider 
binary sentiment classification where each review 
x W is labeled either as positive (i.e. yM = i) or 
negative (i.e. yW = —1). Our cross-domain bi¬ 
nary sentiment classification method can be eas¬ 
ily extended to multi-class classification. First, we 

lemnratize each word in a source domain labeled 

(i) 

review x s , and extract unigrams and bigrams as 

(i) 

features to represent x s by a binary-valued fea¬ 
ture vector. Next, we train a binary linear clas¬ 
sifier, 6, using those feature vectors. Any binary 
classification algorithm can be used for this pur¬ 
pose. We use 6(z) to denote the weight learnt by 
the classifier for a feature ,z. In our experiments, 
we used I 2 regularized logistic regression. 

At test time, we represent a test target review 
by a binary-valued vector h using a the set of un¬ 
igrams and bigrams extracted from that review. 
Then, the activation score, f{h), of h is defined 
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by: In Figure [T] we compare the proposed method 

against two baselines (NA, InDomain), current 
= ^2 ^2 9{c’)f( c 's, C s)+^Z X! 0( w ')f( w 's’ w t) state-of-the-art methods for unsupervised cross¬ 
domain sentiment classification (SFA, SCL), 
word representation learning (GloVe), and cross¬ 
domain similarity prediction (CS). The NA (no- 
adapt) lower baseline uses a classifier trained on 
source labeled data to classify target test data with¬ 
out any domain adaptation. The InDomain base¬ 
line is trained using the labeled data for the target 
domain, and simulates the performance we can ex¬ 
pect to obtain if target domain labeled data were 


(13) 

Here, / is a similarity measure between two vec¬ 
tors. If 'ip(h) > 0, we classify h as positive, and 
negative otherwise. Eq. [I3]measures the similarity 
between each feature in h against the features in 
the classification model 6. For pivots c G ft, we 
use the the source domain representations to mea¬ 
sure similarity, whereas for the (target-specific) 
non-pivots w 6 ft, we use their target domain rep¬ 
resentations. We experimented with several pop¬ 
ular similarity measures for / and found cosine 
similarity to perform consistently well. We can in¬ 
terpret Eq.[T3]as a method for expanding a test tar¬ 
get document using nearest neighbor features from 
the source domain labeled data. It is analogous to 
query expansion used in information retrieval to 
improve document recall ( jFang, 2008j ). Alterna¬ 


available. Spectral Feature Alignment (SFA) (Pan 


tively, Eq. 13 can be seen as a linearly-weighted 
additive kernel function over two feature spaces. 

4 Experiments and Results 

For train and evaluation purposes, we use the 


Amazon product reviews collected by Blitzer et 


al. (2007 1 for the four product categories: books 
(B), DVDs (D), electronic items (E), and kitchen 
appliances (K). There are 1000 positive and 1000 
negative sentiment labeled reviews for each do¬ 
main. Moreover, each domain has on average 
17, 547 unlabeled reviews. We use the standard 
split of 800 positive and 800 negative labeled re¬ 
views from each domain as training data, and the 
rest (200+200) for testing. For validation purposes 
we use movie (source) and computer (target) do¬ 


mains, which were also collected by Blitzer et al. 


(2007), but not part of the train/test domains. 


Experiments conducted using this validation 
dataset revealed that the performance of the pro¬ 
posed method is relatively insensitive to the value 
of the regularization parameter A £ [10 —3 ,10 3 ]. 
For the non-pivot prediction task we generate pos¬ 
itive and negative instances using the procedure 
described in Section |3.2| As a typical example, 


we have 88,494 train instances from the books 
source domain and 141,756 train instances from 
the target domain (1:5 ratio between positive and 
negative instances in each domain). The number 
of pivots and non-pivots are set to N-p = N$ = 
N t = 500. 


et al., 2010) and Structural Correspondence Beam¬ 


ing (SCL) (Blitzer et al., 2007) are the state-of- 
the-art methods for cross-domain sentiment clas¬ 
sification. However, those methods do not learn 
word representations. 

We use Global Vector Prediction (GloVe) (Pen¬ 
nington et a l., 2014) ), the current state-of-the- 


art word representation learning method, to learn 
word representations separately from the source 
and target domain unlabeled data, and use the 
learnt representations in Eq.[l3]for sentiment clas¬ 
sification. In contrast to the joint word representa¬ 
tions learnt by the proposed method, GloVe sim¬ 
ulates the level of performance we would obtain 
by learning representations independently. CS de¬ 
notes the cross-domain vector prediction method 
proposed by [Bollegala et al. (2014) . Although 
CS can be used to learn a vector-space transla¬ 
tion matrix, it does not learn word representations. 
Vertical bars represent the classification accuracies 
(i.e. percentage of the correctly classified test in¬ 
stances) obtained by a particular method on target 
domain’s test data, and Clopper-Pearson 95% bi¬ 
nomial confidence intervals are superimposed. 

Differences in data pre-processing (tokeniza- 
tion/lemmatization), selection (train/test splits), 
feature representation (unigram/bigram), pivot se¬ 
lection (Ml/frequency), and the binary classifica¬ 
tion algorithms used to train the final classifier 
make it difficult to directly compare results pub¬ 
lished in prior work. Therefore, we re-run the orig¬ 
inal algorithms on the same processed dataset un¬ 
der the same conditions such that any differences 
reported in Figure [T] can be directly attributable 
to the domain adaptation, or word-representation 
learning methods compared. 

Ah methods use I 2 regularized logistic regres¬ 
sion as the binary sentiment classifier, and the reg- 

























Figure 1: Accuracies obtained by different methods for each source-target pair in cross-domain sentiment classification. 


ularization coefficients are set to their optimal val¬ 
ues on the validation dataset. SFA, SCL, and CS 
use the same set of 500 pivots as used by the pro¬ 
posed method selected using NPMI (Section [T4] ). 
Dimensionality n of the representation is set to 
300 for both GloVe and the proposed method. 

From Fig. [T] we see that the proposed method 
reports the highest classification accuracies in all 
12 domain pairs. Overall, the improvements of the 
proposed method over NA, GloVe, and CS are sta¬ 
tistically significant, and is comparable with SFA, 
and SCL. The proposed method’s improvement 
over CS shows the importance of predicting word 
representations instead of counting. The improve¬ 
ment over GloVe shows that it is inadequate to 
simply apply existing word representation learn¬ 
ing methods to learn independent word represen¬ 
tations for the source and target domains. 

We must consider the correspondences between 
the two domains as expressed by the pivots to 
jointly learn word representations. As shown in 
Fig. [2] the proposed method reports superior ac¬ 
curacies over GloVe across different dimension¬ 
alities. Moreover, we see that when the dimen¬ 
sionality of the representations increases, initially 
accuracies increase in both methods and saturates 
after 200 — 600 dimensions. Flowever, further 
increasing the dimensionality results in unstable 
and some what poor accuracies due to overfit¬ 
ting when training high-dimensional representa¬ 
tions. Although our word representations learnt 
by the proposed method are not specific to senti¬ 
ment classification, the fact that it clearly outper¬ 
forms SFA and SCL in all domain pairs is encour¬ 
aging, and implies the wider-applicability of the 
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Figure 2: Accuracy vs. dimensionality of the representation. 

proposed method for domain adaptation tasks be¬ 
yond sentiment classification. 

5 Conclusion 

We proposed an unsupervised method for learning 
cross-domain word representations using a given 
set of pivots and non-pivots selected from a source 
and a target domain. Moreover, we proposed a do¬ 
main adaptation method using the learnt word rep¬ 
resentations. 

Experimental results on a cross-domain senti¬ 
ment classification task showed that the proposed 
method outperforms several competitive baselines 
and achieves best sentiment classification accura¬ 
cies for all domain pairs. In future, we plan to 
apply the proposed method to other types of do¬ 
main adaptation tasks such as cross-domain part- 
of-speech tagging, named entity recognition, and 
relation extraction. 

Source code and pre-processed data etc. for this 
publication are publicly availably 

; www.csc.liv.ac.uk/~danushka/prj/darep 
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